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Abstract 

In this paper, we focus on the two key aspects of multiple 
target tracking problem: 1) designing an accurate affinity 
measure to associate detections and 2) implementing an ef¬ 
ficient and accurate (near) online multiple target tracking 
algorithm. As the first contribution, we introduce a novel 
Aggregated Local Flow Descriptor (ALFD) that encodes 
the relative motion pattern between a pair of temporally 
distant detections using long term interest point trajectories 
(IPTs). Leveraging on the IPTs, the ALFD provides a robust 
affinity measure for estimating the likelihood of matching 
detections regardless of the application scenarios. As an¬ 
other contribution, we present a Near-Online Multi-target 
Tracking (NOMT) algorithm. The tracking problem is for¬ 
mulated as a data-association between targets and detec¬ 
tions in a temporal window, that is performed repeatedly at 
every frame. While being efficient, NOMT achieves robust¬ 
ness via integrating multiple cues including ALFD metric, 
target dynamics, appearance similarity, and long term tra¬ 
jectory regularization into the model. Our ablative anal¬ 
ysis verifies the superiority of the ALFD metric over the 
other conventional affinity metrics. We run a comprehen¬ 
sive experimental evaluation on two challenging tracking 
datasets, KITTI [15] and MOT [1] datasets. The NOMT 
method combined with ALFD metric achieves the best ac¬ 
curacy in both datasets with significant margins (about 10% 
higher MOTA) over the state-of-the-arts. 

1. Introduction 

The goal of multiple target tracking is to automatically 
identify objects of interest and reliably estimate the motion 
of targets over the time. Thanks to the recent advancement 
in image-based object detection methods [9, 12, 16, 32], 
tracking-by-detection [3, 5, 10, 23, 25] has become a pop¬ 
ular framework to tackle the multiple target tracking prob¬ 
lem. The advantages of the framework are that it naturally 
identifies new objects of interest entering the scene, that 
it can handle video sequences recorded using mobile plat- 


Figure 1. Bounding box distance and appearance similarity are 
popularly used affinity metrics in the multiple target tracking lit¬ 
erature. However, in real-world crowded scenes, they are often 
ambiguous to successfully distinguish adjacent or similar look¬ 
ing targets. Yet, the optical flow trajectories provide more reliable 
measure to compare different detections across time. Although in¬ 
dividual trajectory may be inaccurate (red line), collectively they 
provide strong information to measure the affinity. We propose a 
novel Aggregated Local Flow Descriptor that exploits the optical 
flow reliably in the multiple target tracking problem. The figure is 
best shown in color. 

forms, and that it is robust to a target drift. The key chal¬ 
lenge in this framework is to accurately group the detec¬ 
tions into individual targets with high accuracy (data asso¬ 
ciation), so one target could be fully represented by a single 
estimated trajectory. Mistakes made in the identity main¬ 
tenance could result in a catastrophic failure in many high 
level reasoning tasks, such as future motion prediction, tar¬ 
get behavior analysis, etc. 

To implement a highly accurate multiple target tracking 
algorithm, it is important to have a robust data association 
model and an accurate measure to compare two detections 
across time (pairwise affinity measure). Recently, much 
work is done in the design of the data association algorithm 
using global (batch) tracking framework [3, 23, 25, 35]. 
Compared to the online counterparts [5, 7, 10, 20], these 
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Figure 2. Our NOMT algorithm solves the global association problem at every time frame t with a temporal window r. Solid circles repre¬ 
sent associated targets, dashed circles represent unobserved detections, dashed lines show finalized target association before the temporal 
window, and solid lines represent the (active) association made in the current time frame. Due to the limited amount of observation, the 
tracking algorithm may produce an erroneous association at t 2 . But once more observation is provided at ts, our algorithm is capable of 
fixing the error made in t 2 . In addition, our method automatically identifies new targets on the fiy (red circles). The figure is best shown in 
color. 


methods have a benefit of considering all the detections 
over entire time frames. With a help of clever optimiza¬ 
tion algorithms, they achieve higher data association accu¬ 
racy than traditional online tracking frameworks. However, 
the application of these methods is fundamentally limited 
to post-analysis of video sequences. On the other hand, the 
pairwise affinity measure is relatively less investigated in 
the recent literature despite its importance. Most methods 
adopt weak affinity measures (see Fig. 1) to compare two 
detections across time, such as spatial affinity (e.g. bound¬ 
ing box overlap or euclidean distance [2, 3, 28]) or simple 
appearance similarity (e.g. intersection kernel with color 
histogram [29]). In this paper, we address the two key chal¬ 
lenging questions of the multiple target tracking problem: 
I) how to accurately measure the pairwise affinity between 
two detections (i.e. likelihood to link the two) and 2) how 
to efficiently apply the ideas in global tracking algorithms 
into an online application. 

As the first contribution, we present a novel Aggregated 
Local Flow Descriptor (ALFD) that encodes the relative 
motion pattern between two detection boxes in different 
time frames (Sec. 3). By aggregating multiple local inter¬ 
est point trajectories (IPTs), the descriptor encodes how the 
IPTs in a detection moves with respect to another detection 
box, and vice versa. The main intuition is that although each 
individual IPT may have an error, collectively they provide 
a strong information for comparing two detections. With a 
learned model, we observe that ALFD provides strong affin¬ 
ity measure, thereby providing strong cues for the associa¬ 
tion algorithm. 

As the second contribution, we propose an efficient 
Near-Online Multi-target Tracking (NOMT) algorithm. In¬ 
corporating the robust ALFD descriptor as well as long¬ 
term motion/appearance models motivated by the success 
of modem batch tracking methods, the algorithm produces 


highly accurate trajectories, while preserving the causality 
property and mnning in real-time (^10 FPS). In every time 
frame t, the algorithm solves the global data association 
problem between targets and all the detections in a tempo¬ 
ral window [t-r, t] of size r (see Fig. 2). The key prop¬ 
erty is that the algorithm is able to fix any association error 
made in the past when more detections are provided. In or¬ 
der to achieve both accuracy and efficiency, the algorithm 
generates candidate hypothetical trajectories using ALFD 
driven tracklets and solve the association problem with a 
parallelized junction tree algorithm (Sec. 4). 

We perform a comprehensive experimental evaluation 
on two challenging datasets: KITTI [15] and MOT Chal¬ 
lenge [1] datasets. The proposed algorithm achieves the 
best accuracy with a large margin over the state-of-the-arts 
(including batch algorithms) in both datasets, demonstrat¬ 
ing the superiority of our algorithm. The rest of the paper is 
organized as follows. Sec. 2 discusses the background and 
related work in multiple target tracking literature. Sec. 3 de¬ 
scribes our newly proposed ALFD. Sec. 4 presents overview 
of NOMT data association model and the algorithm. Sec. 5 
discusses the details of model design. We show the analysis 
and experimental evaluation in Sec. 6, and finally conclude 
with Sec. 7. 

2. Background 

Given a video sequence Vi = {/i, / 2 ,..., It} of length 
T and a set of detection hypotheses Df = {tii, (i 2 , •••, 
where di is parameterized by the frame number ti, a bound¬ 
ing box {di[x] , di[y] , di[w],di[h]y, and the score Si , the 
goal of multiple target tracking is to find a coherent set of 
targets (associations) A = {Ai, A 2 , Am}, where each 
target A^ are parameterized by a set of detection indices 

^ [^] 5 [y] 5 ['^ 1 5 [k\ operators represent the x, y, width and height value, 
respectively. 

























(e.g. Ai = {di, <^ 10 , <^ 23 }) during the time of presence; i.e. 

2.1. Data Association Models 

Most of multiple target tracking algorithms/systems can 
be classified into two categories: online method and global 
(batch) method. 

Online algorithms [5, 7, 10, 20, 27] are formulated to 
find the association between existing targets and detections 
in the current time frame: ^ A^. The ad¬ 

vantages of online formulation are: 1) it is applicable to 
online/real-time scenario and 2) it is possible to take ad¬ 
vantage of targets’ dynamics information available in A^-i. 
Such methods, however, are often prone to association er¬ 
rors since they consider only one frame when making the 
association. Solving the problem based on (temporally) lo¬ 
cal information can fundamentally limit the association ac¬ 
curacy. To avoid such errors, [5] adopts conservative asso¬ 
ciation threshold together with detection confidence maps, 
or [7, 20, 27] model interactions between targets. 

Recently, global algorithms [2, 3, 25, 28, 35] became 
much popular in the community, as more robust associa¬ 
tion is achieved when considering long-term information in 
the association process. One common approach is to for¬ 
mulate the tracking as the network flow problem to directly 
obtain the targets from detection hypothesis [3, 28, 35]; i.e. 

A^. Although they have shown promising 
accuracy in multiple target tracking, the methods are of¬ 
ten over-simplified for the tractability concern. They ig¬ 
nore useful target level information, such as target dynam¬ 
ics and interaction between targets (occlusion, social in¬ 
teraction, etc). Instead of directly solving the problem at 
one step, other employ an iterative algorithm that progres¬ 
sively refines the target association [2, 18, 23, 25]; i.e. 
(Vi^,Df,Af) ^ where i represent an iteration. 

Starting from short trajectories (tracklet), [18, 23] associate 
them into longer targets in a hierarchical fashion. [2, 25] 
iterate between two modes, association and continuous es¬ 
timation. Since these methods obtain intermediate target 
information, targets’ dynamics, interaction and high-order 
statistics on the trajectories could be accounted that can lead 
to a better association accuracy. However, it is unclear how 
to seamlessly extend such models to an online application. 

We propose a novel framework that can fill in the gap be¬ 
tween the online and global algorithms. The task is defined 
as to solve the following problem: (V"/, A^“^) ^ A^ 

in each time frame t, where r is pre-defined temporal win¬ 
dow size. Our algorithm behaves similar to the online algo¬ 
rithm in that it outputs the association in every time frame. 
The critical difference is that any decision made in the 
past is subject to change once more observations are avail¬ 
able. The association problems in each temporal window 
are solved using a newly proposed global association algo¬ 


rithm. Our method is also reminiscent of iterative global 
algorithm, since we augment all the track iteratively (one 
iteration per frame) considering multiple frames, that leads 
to a better association accuracy. 

2.2. Affinity Measures in Visual Tracking 

The importance of a robust pairwise affinity measure (i.e. 
likelihood of di and dj being the same target) is relatively 
less investigated in the multi-target tracking literature. Most 
of the recent literature [2, 3, 28, 29] employs a spatial dis¬ 
tance and/or an appearance similarity with simple features 
(such as color histograms). In order to learn a discrimi¬ 
native affinity metric, Kuo et al. [23] introduces an online 
appearance learning with boosting algorithm using various 
feature inputs such as HoG [8], texture feature, and RGB 
color histogram. Milan et al. [25] and Zamir et al. [29] pro¬ 
posed to use a global appearance consistency measure to en¬ 
sure a target has a similar (or smoothly varying) appearance 
over a long term. Although there have been many works ex¬ 
ploiting appearance information or spatial smoothness, we 
are not aware of any work employing optical flow trajecto¬ 
ries to define a likelihood of matching detections. Recently, 
Fragkiadaki et al. [13] introduced a method to track multi¬ 
ple targets while jointly clustering optical fiow trajectories. 
The work presents a promising result, but the model is com¬ 
plicated due to the joint inference on both target and fiow 
level association. In contrast, our ALFD provides a strong 
pairwise affinity measure that is generally applicable in any 
tracking model. 

3 . Aggregated Local Flow Descriptor 

The Aggregated Local Flow Descriptor (ALFD) encodes 
the relative motion pattern between two bounding boxes in 
a temporal distance (At = \ti — tj\) given interest point 
trajectories [31]. The main intuition in ALFD is that if the 
two boxes belong to the same target, we shall observe many 
supporting IPTs in the same relative location with respect 
to the boxes. In order to make it robust against small lo¬ 
calization errors in detections, targets’ orientation change, 
and outliers/errors in the IPTs, we build the ALFD using 
spatial histograms. Once the ALFD is obtained, we mea¬ 
sure the affinity between two detections using the linear 
product of a learned model parameter WAt and ALFD, i.e. 
aA{di, dj) = WAt • dj). In the following subsections, 
we discuss the details of the design. 

3.1. Interest Point Trajectories 

We obtain Interest Point Trajectories using a local inter¬ 
est point detector [4, 30] and optical fiow algorithm [4, 11]. 
The algorithm is designed to produce a set of long and 
accurate point trajectories, combining various well-known 
computer vision techniques. Given an image It, we run 
the FAST interest point detector [4, 30] to identify “good 
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Figure 3. Illustrative figure for unidirectional ALFDs p'{di,dj). 
In the top figure, we show detections as colored bounding boxes 
(dred, dhiue, and dgreen)- A pair of circles with connecting lines 
represent IPTs that are existing in both t and t + At and located 
inside of the dred at t. We draw the accurate (green), outlier 
(black), and erroneous (red) IPTs. In the bottom figure, we show 
two exemplar unidirectional ALFDs p' for (dred, duue) and (dred, 
dgreen)- The red grids (2 x 2) represent the IPTs’ location at t 
relative to dred- The blue and green grids inside of each red bin 
(2x2 + 2 external bins) shows the IPTs’ location at t + At relative 
to the corresponding boxes. IPTs in the grid bins with a red box 
are the one observed in the same relative location. Intuitively, the 
more IPTs are observed in the bins, the more likely the two detec¬ 
tions belong to the same target. In contrast, wrong matches will 
have more supports in the outside bins. The illustration is shown 
using 2x2 grids to avoid clutter. We use 4 x 4 in practice. The 
figure is best shown in color. 


points” to track. In order to avoid having redundant points, 
we compute the distance between the newly detected inter¬ 
est points and the existing IPTs and keep the new points 
sufficiently far from the existing IPTs (> 4 px). The new 
points are assigned unique IDs. For all the IPTs in t, we 
compute the forward (t ^ t + 1) and backward (t + 1 ^ t) 
optical flow using [4, II]. The starting points of backward 
flows are given by the forward flows’ end point. Any IPT 
having a large disagreement between the two (> 10 px) is 
terminated. 


At = 1 At = 20 



Figure 4. Visualization of two learned model weights u;ai and 
u;a 20 . Having a higher p value in the bright (white) bins yields 
a higher affinity measure. As the temporal distance increase, the 
model weights tend to spread out to the adjacent bins to account 
for a possible targets’ orientation change and higher IPT errors. 


3.2. ALFD Design 

Let us define the necessary notations to discuss ALFD. 
K,id ^ fC represents an IPT with a unique id that is param¬ 
eterized by pixel locations {Kid{t)[x], Kid{t)[y]) during the 
time of presence. K,id{t) denotes the pixel location at the 
frame t. If Kid does not exist at t (terminated or not initi¬ 
ated), 0 is returned. 

We first define a unidirectional ALFD p'{di^dj), i.e. 
from di to dj, by aggregating the information from all the 
IPTs that are located inside of di box and existing at tj . For¬ 
mally, we define the IPT set as lC{di^ dj) = ^ 

di & K,id{tj) 7^ 0}- For each Hid G lC{di^dj), we com¬ 
pute the relative location ri{nid) = (x, y) of each Hid at U 

ri{Kid)[x\ = {Kid{ti)[x\-di[x\)/di[w\ md ri{Kid)[y\ = 
{i^id{ti)[y\-di[y\)/di[h]. We compute rj{Kid) similarly. 
Notice that ri(Kid) are bounded between [0,1]> but rj(Kid) 
are not bounded since Hid can be outside of dj . Given the 
and rj{Hid), we compute the corresponding spatial 
grid bin indices as shown in the Fig. 3 and accumulate the 
count to build the descriptor. We define 4x4 grids for 
^i{i^id) and 4x4 + 2 grids for Vj{Hid) where the last 2 bins 
are accounting for the outside region of the detection. The 
first outside bin defines the neighborhood of the detection 
(< width/A & < height!^, and the second outside bin 
represents any farther region. 

Using a pair of unidirectional ALFDs, we define the 
ALFD as p{di,dj) = {p'{di,dj) p'{dj,di)) / n{di,dj), 
where n{di , dj ) is a normalizes The normalizer n is defined 
as n{di, dj) = \JC{di,dj)\A- \JC{dj,di) \ + A, where |/C(')| is 
the count of IPTs and A is a constant. A ensures that the LI 
norm of the ALFD increases as we have more supporting 
JC{di, dj) and converges to 1. We use A = 20 in practice. 

3.3. Learning the Model Weights 

We learn the model parameters w^t from a training 
dataset with a weighted voting. Given a set of detections Df 
and corresponding ground truth (GT) target annotations, we 








































































































first assign the GT target id to each detections. For each de¬ 
tection di, we measure the overlap with all the GT boxes in 
ti. If the best overlap Oi is larger than 0.5, the correspond¬ 
ing target id (idi) is assigned. Otherwise, —1 is assigned. 
For all detections that has idi > 0 (positive detections), we 
collect a set of detections = {dj G — ti = At}. 

For each pair, we compute the margin rriij as follows: if idi 
and idj are identical, rriij = {oi — 0.5) • (oj — 0.5). Other¬ 
wise, rriij = —{oi — 0.5) • {oj — 0.5). Intuitively, rriij shall 
have a positive value if the two detections are from the same 
target, while rriij will have a negative value, if the di and dj 
are from different targets. The magnitude is weighted by 
the localization accuracy. Given all the pairs and margins, 
we learn the model re At as follows: 

\id^>Q} (di, dj) + p (dj, di)) 

\id^>o} ^jgpA* (di, dj) + p'(dj, di)) 

The algorithm computes a weighted average with a sign 
over all the ALFD patterns, where the weights are deter¬ 
mined by the overlap between targets and detections. Intu¬ 
itively, the ALFD pattern between detections that matches 
well with GT contributes more on the model parame¬ 
ters. The advantage of the weighted voting method is 
that each element in WAt are bounded in [—1,1], thus the 
ALFD metric, aA{di, dj), is also bounded by [—1,1] since 
\\p{di,dj)\\i < 1. Fig. 4 shows two learned model using 
our method. One can adopt alternative learning algorithms 
like SVM [6]. 

3.4. Properties 

In this section, we discuss the properties of ALFD affin¬ 
ity metric aA{di, dj). Firstly, unlike appearance or spatial 
metrics, ALFD implicitly exploit the information in all the 
images between ti and tj through IPTs. Secondly, thanks 
to the collective nature of ALFD design, it provides strong 
affinity metric over arbitrary length of time. We observe a 
significant benefit over the appearance or spatial metric es¬ 
pecially over a long temporal distance (see Sec. 6.1 for the 
analysis). Thirdly, it is generally applicable to any scenar¬ 
ios (either static or moving camera) and for any object types 
(person or car). A disadvantage of the ALFD is that it may 
become unreliable when there is an occlusion. When an oc¬ 
clusion happens to a target, the IPTs initiated from the target 
tend to adhere to the occluder. It motivates us to combine 
target dynamics information discussed in Sec. 5.1. 

4. Near Online Multi-target Tracking (NOMT) 

We employ a near-online multi-target tracking frame¬ 
work that updates and outputs targets in each time frame 
considering inputs in a temporal window [t-r, t]. We im¬ 
plement the NOMT algorithm with a hypothesis generation 
and selection scheme. For the convenience of discussion. 


we define clean targets A*^“^ = ...} that 

exclude all the associated detections in [t-r,t-l]. Given 
a set of detections in [t-r,t] and clean targets A*^“^, we 
generate multiple target hypotheses ^ = 

0, 2 5 3 ---} target as well as newly 

entering targets, where 0 (empty hypothesis) represents the 
termination of the target and each ^ indicates a set of 
candidate detections in [t-r, t] that can be associated to a 
target (Sec. 4.2). Each ^ may contain 0 to r detections 
(at one time frame, there can be 0 or 1 detection). Given the 
set of hypotheses for all the existing and new targets, the al¬ 
gorithm finds the most consistent set of hypotheses (MAP) 
for all the targets (one for each) using a graphical model 
(sec. 4.3). As the key characteristic, our algorithm is able to 
fix any association error (for the detections within the tem¬ 
poral window [t-r, t] ) made in the previous time frames. 

4.1. Model Representation 

Before going into the details of each step, we dis¬ 
cuss our underlying model representation. The model is 
formulated as an energy minimization framework; x = 
argmin^ L^(A*^“^, EI^(x), B^_^, V^), where x is an integer 
state vector indicating which hypothesis is chosen for a 
corresponding target, is the set of all the hypotheses 
...}, and EI^(x) is a set of selected hypothesis 
{H\ Solving the optimization, the updated 

targets hd can be uniquely identified by augmenting A*^“^ 
with the selected hypothesis EI^(x). Hereafter, we drop 
and B^_^ to avoid clutters in the equations. The energy is 
defined as follows: 

+ E (2) 

where T^(-) encodes individual target’s motion, appearance, 
and ALFD metric consistency, and T>(-) represent an exclu¬ 
sive relationship between different targets (e.g. no two tar¬ 
gets share the same detection). If there are hypotheses for 
newly entering targets, we define the corresponding target 
as an empty set, = 0. 

Single Target Consistency 

The potential measures the compatibility of a hypothe¬ 
sis to a target Mathematically, this can be 

decomposed into unary, pairwise and high order terms as 
follows: 

ieH^ 

+ 5 ]; '^p(di,dj) + iJh(A*S\Hl^^^) 


(3) 




Figure 5. Schematic illustration of NOMT algorithm, (a) Given a set of existing targets and detections (h) our method 

generates a set of candidate hypotheses using tracklets T. Constructing a CRF model with the hypotheses, (c) we select the most 
consistent solution x using our inference algorithm and (d) output targets are obtained by augmenting previous targets with the 
solution EI^(x). See text for the details. 


encodes the compatibility of each detection di in the tar¬ 
get hypothesis using the ALFD affinity metric and 

Target Dynamics feature (Sec. 5.1). ipp measures the pair¬ 
wise compatibility (self-consistency of the hypothesis) be¬ 
tween detections within (Sec. 5.2) using the ALFD 

metric. Finally, ^ph implements a long-term smoothness 
constraint and appearance consistency (Sec. 5.3). 

Mutual Exclusion 

This potential penalizes choosing two targets with large 
overlap in the image plane (repulsive force) as well as du¬ 
plicate assignments of a detection. Instead of using “hard” 
exclusion constraints as in the Hungarian Algorithm [22], 
we use “soft” cost function for fiexibility and computational 
simplicity. If the single target consistency is strong enough, 
soft penalization cost could be overcome. Also, this for¬ 
mulation makes it possible to reuse popular graph inference 
algorithms discussed in Sec. 4.3. The potential can be writ¬ 
ten as follows: 

t 

f = t-T 

where , /) gives the associated detection of 

Hln^xm f none, 0 is returned), o^{di^dj) = 

2 * IoU{di^djY, and l{di^dj) is an indicator function. 
The former penalizes having too much overlap between hy¬ 
potheses and the later penalizes duplicate assignments of 
detections. We use a = 0.5 and = 100 (large enough to 
avoid duplicate assignments). 

4.2. Hypothesis Generation 

Direct optimization over the aforementioned objective 
function (eq. 2) is infeasible since the space of is huge in 
practice. To cope with the challenge, we first propose a set 
of candidate hypotheses Hm for each target independently 
(Fig. 5(b)) and find a coherent solution (MAP) using a CRF 
inference algorithm (sec. 4.3). As all the subsequent steps 
depend on the generated hypotheses, it is critical to have a 


comprehensive set of target hypotheses. We generates the 
hypotheses of existing and new targets using tracklets. No¬ 
tice that following steps could be done in parallel since we 
generate the hypotheses set per target independently. 

Tracklet Generation 

For all the confident detections (idi G s.t. Si > 

0), we build a tracklet using the ALFD metric a a- Starting 
from one detection tracklet % = {di},wQ grow the tracklet 
by greedily adding the best matching detection dk such that 
k = arginaxj,gD* \% maXjeTiaA{dj,dk), where 

t T 

is the set of detections in [t-r, t] excluding the frames al¬ 
ready included in 7^. If the best ALFD metric is lower than 
0.4 or 7i is full (has r number of detections), the iteration is 
terminated. In addition, we also extracts the residual detec¬ 
tions from each in [t-r, t] to obtain additional tracklets 
(i.e. Vm, Since there can be identical track- 

lets, we keep only unique tracklets in the output set T. 

Hypotheses for Existing Targets 

We generate a set of target hypotheses for each ex¬ 
isting target using the tracklets T. In order to avoid 
having unnecessarily large number of hypotheses, we em¬ 
ploy a gating strategy. For each target we obtain a 

target predictor using the least square algorithm with poly¬ 
nomial function [24]. We vary the order of the polynomial 
depending on the dataset (1 for MOT and 2 for KITTI). If 
there is an overlap (loU) larger than a certain threshold be¬ 
tween the prediction and the detections in the tracklet % at 
any frame in [t-r, t], we add % to the hypotheses set In 
practice, we use a conservative threshold 0.1 to have a rich 
set of hypotheses. Too old targets (having no associated de¬ 
tection in [t-r-Tactive^ 1]) are ignored to avoid unnecessary 
computational burden. We use TacUve = 1 sec. 

New Target Hypotheses 

Since new targets can enter the scene at any time and at 
any location, it is desirable to automatically identify new 
targets. Our algorithm can naturally identify the new tar¬ 
gets by treating any tracklet in the set T as a potential new 










































target. We use a non-maximum suppression on tracklets to 
avoid having duplicate new targets. For each tracklet 7i, 
we simply add an empty target = 0 to A*^“^ with an 

associated hypotheses set = {0,7^}. 

4.3. Inference with Dynamic Graphical Model 

Once we have all the hypotheses for all the new and 
existing targets, the problem (eq. 2) can be formulated as 
an inference problem with an undirected graphical model, 
where one node represents a target and the states are hy¬ 
pothesis indices as shown in Fig. 5 (c). The main challenges 
in this problem are: 1) there may exist loops in the graphical 
model representation and 2) the structure of graph is differ¬ 
ent depending on the hypotheses at each circumstance. In 
order to obtain the exact solution efficiently, we first analyze 
the structure of the graph on the fly and apply appropriate 
inference algorithms based on the structure analysis. 

Given the graphical model, we find independent sub¬ 
graphs (shown as dashed boxes in Fig. 5 (c)) using con¬ 
nected component analysis [17] and perform individual in¬ 
ference algorithm per each subgraph in parallel. If a sub¬ 
graph is composed of more than one node, we use junction- 
tree algorithm [21, 26] to obtain the solution for correspond¬ 
ing subgraph. Otherwise, we choose the best hypothesis for 
the target. 

Once the states x are found, we can uniquely identify 
the new set of targets by augmenting A*^“^ with EI^(x): 
A*t-i ^ ]HI^(x) ^ A^. This process allows us to adjust 
any associations of A^“^ in [t-r, t] (i.e. addition, deletion, 
replacement, or no modification). 

5. Model Details 

In this section, we discuss the details of the potentials 
described in the Eq. 3. 

5.1. Unary potential 

As discussed in the previous sections, we utilize the 
ALFD metric as the main affinity metric to compare detec¬ 
tions. The unary potential for each detection in the hypoth¬ 
esis is measured by: 

IXA{A*^-\dt) = - Y. aA(d{A*^-\ti- M),di) ( 5 ) 

where A/" is a predefined set of neighbor frame distances 
and gives the associated detection of at 

A. Although we can define an arbitrarily large set of A/", we 
choose J\f = {1, 2, 5,10, 20} for computational efficiency 
while modeling long term affinity measures. 

Although ALFD metric provides very strong informa¬ 
tion in most of the cases, there are few failure cases includ¬ 
ing occlusions, erroneous IPTs, etc. To complement such 
cases, we design an additional Target Dynamics (TD) fea¬ 
ture ^ di). Using the same polynomial least square 


predictor discussed in Sec. 4.2, we define the feature as fol¬ 
lows: 


^ A *t-l J \ _ f {pi^m 1 ^ 1)1 di) < 0.5 

^ otherwise 

( 6 ) 

where 7 / is a decay factor (0.98) that discounts long term 
prediction, denotes the last associated frame of 

represents IoU‘^ discussed in the Sec. 4.1, and p is 
the polynomial least square predictor described in Sec. 4.2. 

Using the two measures, we define the unary potential 
i>u{A*X^,di) as: 


ipu(A'^ ^,di) = m.m{iJ.A{A*^ ^,di),HT(A^ ^,di))-Si ( 7 ) 

where Si represents the detection score of di . The min op¬ 
erator enables us to utilize the ALFD metric in most cases, 
but activate the TD metric only when it is very confident 
(more than 0.5 overlap between the prediction and the de¬ 
tection). If is empty, the potential becomes —Si. 


5.2. Pairwise potential 


The pairwise potential Ap(') is solely defined by the 
ALFD metric. Similarly to the unary potential, we define 
the pairwise relationship between detections in , 


'ipp{di,dj 



-aA{di,dj), 

0, 


if \di - dj \ eJV 
otherwise 


( 8 ) 


It measures the self-consistency of a hypothesis 77^ ^ 


5.3. High-order potential 


We incorporate a high-order potential to regularize the 
target association process with a physical feasibility and 
appearance similarity. Firstly, inspired by [2, 29], we im¬ 
plement the physical feasibility by penalizing the hypothe¬ 
ses that present an abrupt motion. Secondly, we encodes 
long term appearance similarity between all the detections 
in and 77^ similarly to [29]. The intuition is en¬ 
coded by the following potential: 


+ £• 9-K(di,dj) (9) 

where 7, e, 6> are scalar parameters, C(a,6) measures the 
sum of squared distances in (x, y, height) of the two boxes, 
that is normalized by the mean height of p in [f-r, t ], and 
K{di^dj) represents the intersection kernel for color his¬ 
tograms associated with the detections. We use a pyra¬ 
mid of LAB color histogram where the first layer is the 
full box and the second layer is 3 x 3 grids. Only the A 
and B channels are used for the histogram with 4 bins per 
each channel (resulting in 4 x 4 x (1 + 9) bins). We use 
(7, e, 0) = (20, 0.4, 0.8) in practice. 


KITTI 0001: Cars, Mobile camera 

Metric 

At = 1 

At = 2 

At = 5 

At = 10 

At = 20 

ALFD 

0.91 

0.89 

0.84 

0.80 

0.71 

NDist2 

0.81 

0.66 

0.32 

0.15 

0.06 

HistIK 

0.81 

0.76 

0.62 

0.51 

0.38 

PETS09-S2L1: Pedestrians, Static camera 

Metric 

At = 1 

At = 2 

At = 5 

At = 10 

> 

II 

to 

o 

ALFD 

0.88 

0.86 

0.83 

0.78 

0.68 

NDist2 

0.85 

0.78 

0.67 

0.55 

0.41 

HistIK 

0.76 

0.71 

0.65 

0.60 

0.51 


Table 1. AUC of affinity metrics for varying At. Notice that ALFD 


provides a robust affinity metric even at 20 frames distance. The 
results verify that ALFD provides stable affinity measure regard¬ 
less of object type or the camera motion. 


PETS09-S2L1: Pedestrians, Static camera 




6 . Experimental Evaluation 

In order to evaluate the proposed algorithm, we use the 
KITTI object tracking benchmark [15] and MOT challenge 
dataset [1]. KITTI tracking benchmark is composed of 
about 19,000 frames 32 minutes). The dataset is com¬ 
posed of 21 training and 29 testing video sequences that are 
recorded using cameras mounted on top of a moving vehi¬ 
cle. Each video sequence has a variable number of frames 
from 78 to 1176 frames having a variable number of tar¬ 
get objects (Car, Pedestrian, and Cyclist). The videos are 
recorded at 10 FPS. The dataset is very challenging since 
1 ) the scenes are crowded (occlusion and clutter), 2) the 
camera is not stationary, and 3) target objects appears in 
arbitrary location with variable sizes. Many conventional 
assumptions/techniques adopted in multiple target tracking 
with a surveillance camera is not applicable in this case 
(e.g. fixed entering/exiting location, background subtrac¬ 
tion, etc). MOT challenge is composed of 11, 286 frames 
16.5 minutes) with varying FPS. The dataset is com¬ 
posed of 11 training and 11 testing video sequences. Some 
of the videos are recorded using mobile platform and the 
others are from surveillance videos. All the sequences con¬ 
tain only Pedestrians. As it is composed of videos with var¬ 
ious configuration, tracking algorithms that are particularly 
tuned for a specific scenario would not work well in general. 
For the evaluation, we adopt the widely used CLEAR MOT 
tracking metrics [19]. For a fair comparison to the other 
methods, we use the reference object detections provided 
by the both datasets. 

6.1. ALFD Analysis 

We first run an ablative analysis on our ALFD affin¬ 
ity metric. We choose two sequences, KITTFs 0001 and 
mot’s PETS09-S2L1 both from the training sets, for the 
analysis. Given all the detections and the ground truth an¬ 
notations, we first find the label association between detec¬ 
tions and annotations. For each detection, we assign ground 
truth id if there is larger than 0.5 overlap. We collect all 
possible pairs of detections in 1, 2, 5,10, 20 frame distance 


Figure 6. Corresponding ROC curves for Table. 1. X axis is 
False-Positive-Rate and Y axis is True-Positive-Rate. Notice that 
NDist2 measure becomes quickly unreliable as the temporal dis¬ 
tance increases, when the camera is moving. 

(At), to obtain the positive and negative pairs. As the base¬ 
line affinity measures, we use the L2 distance between bot¬ 
tom center of the detections that is normalized by the mean 
height of the two (NDist2) and the intersection kernel be¬ 
tween the color histograms of the two (HistIK). Fig. 6 and 
Table. 1 show the ROC curve and AUC of each affinity met¬ 
ric. We observe that ALFD affinity metric performs the best 
in all temporal distance regardless of the camera configura¬ 
tion and object type. As the temporal distance increases, 
the other metrics become quickly unreliable as expected, 
whereas our ALFD metric still provides strong cue to com¬ 
pare different detections. 

6.2. KITTI Testing Benchmark Evaluation 

Table. 2 summarizes the evaluation accuracy of our 
method (NOMT) and the other state-of-the-art algorithms 
on the whole 28 test video sequences^. We also imple¬ 
mented an online tracking algorithm with the Hungarian 
method [22] (HM) using our unary match function. Any 
match cost larger than -0.5 is set to be an invalid match. In 
following evaluations, we set the temporal window r = 10 
and filter out targets that either have only one detection or 
a median detection score lower than 0. We use the Kalman 
Filter [33] to obtain continuous trajectories out of discrete 
detection sets A. Since the KITTI evaluation system does 
not provide results on Cyclist category (due to lack of suffi¬ 
cient data), we report the accuracy of Car and Pedestrian 
categories. We also run the experiments with more ad¬ 
vanced detection results (HM+[32] and NOMT+[32]). 

As shown in the table, we observe that our algorithm 
(NOMT) outperforms the other state-of-the-art methods in 
most of the metrics with significant margins. Our method 

^The comparison is also available at http : //www. cvlibs . net/ 
datasets/kitti/eval_tracking .php that includes other anony¬ 
mous submissions. 
























































II Method II Rec. t 

Prec. t 

1 Pit II MOTAt 1 

MOTPt 

1 MTt 

1 MLt 1 

IDSt 1 

PRAGt 

Car Tracking Benchmark 

DPMF [28] 

Batch 

45.52 % 

96.48 % 

61.85 % 

43.77 % 

78.49 % 

11.08 % 

39.45 % 

2738 

3241 

TBD [14] 

Batch 

54.47 % 

95.44 % 

69.36 % 

51.73 % 

78.47 % 

13.81 % 

34.60 % 

33 

540 

CEM [25] 

Batch 

53.75 % 

90.31 % 

67.39 % 

47.81 % 

77.26 % 

14.42 % 

33.99 % 

125 

401 

RMOT [34] 

Online 

55.58 % 

90.06 % 

68.74 % 

49.25 % 

75.33 % 

15.17 % 

33.54 % 

51 

389 

HM 

Online 

62.13 % 

94.06 % 

74.83 % 

57.55 % 

78.79 % 

26.86 % 

30.5 % 

28 

253 

NOMT 

Online 

67.01 % 

94.02 % 

78.25 % 

62.44 % 

78.32 % 

31.56 % 

27.77 % 

13 

159 

RMOT [34]+[32] 

Online 

78.16 % 

82.64 % 

80.34 % 

60.27 % 

75.57 % 

27.01 % 

11.38 % 

216 

755 

HM+[32] 

Online 

78.47 % 

90.71 % 

84.15 % 

69.12 % 

80.10 % 

38.54 % 

15.02 % 

109 

378 

NOMT+[32] 

Online 

80.79 % 

91.00 % 

85.59 % 

71.68 % 

79.55 % 

43.10 % 

13.96 % 

39 

236 

Pedestrian Tracking Benchmark 

CEM [25] 

Batch 

46.92 % 

81.59 % 

59.58 % 

36.21 % 

74.55 % 

7.95 % 

53.04 % 

221 

1011 

RMOT [34] 

Online 

50.88 % 

82.51 % 

62.95 % 

39.94 % 

72.86 % 

10.02 % 

47.54 % 

132 

1081 

HM 

Online 

52.28 % 

83.89 % 

64.42 % 

41.67 % 

75.77 % 

11.43% 

51.65 % 

101 

996 

NOMT 

Online 

59.00 % 

84.44 % 

69.46 % 

47.84 % 

75.01 % 

14.54 % 

43.10 % 

47 

959 

RMOT [34]+[32] 

Online 

68.55 % 

80.76 % 

74.16 % 

51.06% 

74.19% 

16.93 % 

41.28 % 

372 

1515 

HM+[32] 

Online 

67.58 % 

85.05 % 

75.32 % 

54.46 % 

77.51 % 

17.31 % 

42.32 % 

295 

1248 

NOMT+[32] 

Online 

70.80 % 

86.60 % 

77.91 % 

58.80 % 

77.10 % 

23.52 % 

34.76 % 

102 

908 


Table 2. Multiple Target tracking accuracy for KITTI Car/Pedestrian tracking benchmark, t represents that high numbers are better for the 
metric and -I means the opposite. The best numbers in each column are bold-faced. We use r = 10 for NOMT and NOMT-i-[32]. 


Method 

1 FPf 

1 PNt II MOTAt 

1 MOTPt 1 

MTt 1 

MLt 1 

1 IDSt 1 

PRAGt 

Pedestrian Tracking Benchmark 

DP [28] 

Batch 

13,171 

34,814 

14.5 % 

70.8 % 

6.0% 

40.8 % 

4,537 

3,090 

TBD [14] 

Batch 

14,943 

34,777 

15.9 % 

70.9 % 

6.4% 

47.9 % 

1,939 

1,963 

RMOT [34] 

Online 

12,473 

36,835 

18.6 % 

69.6 % 

5.3 % 

53.3 % 

684 

1,282 

CEM [25] 

Batch 

14,180 

34,591 

19.3 % 

70.7 % 

8.5 % 

46.5 % 

813 

1,023 

HM 

Online 

11,162 

33,187 

26.7 % 

71.5 % 

11.2% 

47.9 % 

669 

916 

NOMT 

Online 

7,762 

32,547 

33.7 % 

71.9 % 

12.2 % 

44.0 % 

442 

823 


Table 3. Multiple Target tracking accuracy for MOT Challenge, t represents that high numbers are better for the metric and I means the 
opposite. The best numbers in each column are bold-faced. We use r = 10 for NOMT. 


produces much larger numbers of mostly tracked targets 
(MT) in both Car and Pedestrian experiments with smaller 
numbers of mostly lost targets (ML). This is thanks to the 
highly accurate identity maintenance capability of our al¬ 
gorithm demonstrated in the low number of identity switch 
(IDS) and fragmentation (FRAG). In turn, our method 
achieves highest MOTA compared to other state-of-the-arts 
(> 10% for Car and > 8% for Pedestrian), which summa¬ 
rize all aspects of tracking evaluation. Notice that the higher 
tracking accuracy results in the higher detection accuracy as 
shown in Recall, Precision, and FI metrics. Our own HM 
baseline also performs better than the other state-of-the-art 
methods, which demonstrates the robustness of ALFD met¬ 
ric. However, due to the nature of pure online association 
and lack of high order potential, it ends up missing more 
targets as shown in the MT and ML measures. 

6.3. MOT Challenge Evaluation 

Table. 3 summarizes the evaluation accuracy of our 
method (NOMT) and the other state-of-the-art algorithms 
on the MOT test video sequences^. The website provides a 
set of reference detections obtained using [9] . 

Similarly to the KITTI experiment, we observe that our 
algorithm outperforms the other state-of-the-art methods 

^The comparison is also available at http://nyx.ethz.ch/ 
view_results.php?chl=2. 


Dataset 

EPS 

IPT 

CHist 

Hypos 

Infer 

Total 

KITTI (11,095) 

10.27 

644.2 

238.8 

236.0 

15.6 

1,080.2 

KITTI+[32] (11,095) 

10.15 

615.6 

161.5 

144.9 

40.3 

1,092.5 

MOT (5,783) 

11.5 

323.4 

92.7 

62.1 

19.6 

502.5 


Table 4. Computation time on KITTI and MOT test datasets. The 
total number of images is shown in parentheses. We report the 
average FPS (images/total) and the time (seconds) spent in IPT 
computation (IPT), Color Histogram extraction (CHist), Hypothe¬ 
sis generation (Hypos) that includes all the potential computations, 
and the CRF inference (Infer). Total time includes file 10 (reading 
images). The main bottleneck is the optical fiow computation in 
IPT module, that can be readily improved using a GPU architec¬ 
ture. 

with significant margins. Our method achieves the low¬ 
est identity switch dind fragmentation while achieving the 
highest detection accuracy (lowest False Positives (FP) and 
False Negatives (FN)). In turn, our method records the high¬ 
est MOTA compared to the other state-of-the-arts with a sig¬ 
nificant margin (> 14%). The two experiments demonstrate 
that our ALFD metric and NOMT algorithm is generally 
applicable to any application scenario. Fig. 7 shows some 
qualitative examples of our results. 

6.4. Timing Analysis 

In order to understand the timeliness of the NOMT 
method, we measure the latency by computing the differ¬ 
ence between detection time (ti of di in A^) and the last 
















































































MOT : AVG-TownCentre @ 237 MOT : TUD-Crossing @ 70 MOT : PETS09-S2L1 @ 306 MOT : PETS09-S2L2 @ 140 


KITTI Train 0001 @ 225 KITTI Train 0009 @ 147 KITTI Train 0017 @ 34 



KITTI Test 0007 @ 78 KITTI Test 0010 @ 73 KITTI Test 0016 @ 340 



Figure 7. Qualitative examples of the tracking results. We show the bounding boxes together with the past trajectories (last 30 and 10 
frames for MOT and KITTI, respectively). The color of the boxes and trajectories represents the identity of the targets. Notice that our 
method can generate long trajectories with consistent IDs in challenging situations, such as occlusion, fast camera motion, etc. The figure 
is best shown in color. 


association time. The last association time is defined as: if 
a detection di is newly added to a target or replace any 
other detection (e.g. ti = tj) in at t, t is recorded 
as the last association time for di. If di was in the no 
change is made to the last association time of di. The last 
association time tells us when the algorithm first recognizes 
the di as a part of A'^ (the final trajectory output for the 
target m). The mean and standard deviation are 0.59 ± 1.75 
and 0.66 ± 1.87 with [32] for the KITTI test set (84.7% and 
83.9% with no latency) and 0.87 ± 2.04 for the MOT test 
set (77.6% with no latency). It shows that NOMT is indeed 
a near online method. 

Our algorithm is not only highly accurate, but also 
very efficient. Leveraging on the parallel computation, we 
achieve a real-time efficiency (^ 10FPS) using a 2.5GHz 
CPU with 16 cores. Table. 4 summarizes the time spent in 
each computational module. 

7. Conclusion 

In this paper, we propose a novel Aggregated Local Flow 
Descriptor that enables us to accurately measure the affin¬ 
ity between a pair of detections and a Near Online Muti- 
target Tracking that takes the advantages of both the pure 
online and global tracking algorithms. Our controlled ex¬ 
periment demonstrates that ALFD based affinity metric is 
significantly better than other conventional affinity metrics. 
Equipped with ALFD, our NOMT algorithm generates sig¬ 
nificantly better tracking results on two challenging large- 
scaler datasets. In addition, our method runs in real-time 


that enables us to apply the method in a variety of applica¬ 
tions including autonomous driving, real-time surveillance, 
etc. 
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